Background and Motivation

As someone who is thinking about going into the field of Data Science and Machine Learning, I have looked through online job ads for data scientists and got overwhelmed by the amount of skills required. It would take a very long time to learn the whole repertoire of data science tools.

To help decide which tools to learn first, I decided to look at which data science skills are currently most in-demand by employers in major cities in Canada and U.S.

Just like any sane person would do these days, I googled “web scraping indeed” to see if I can scrape job ads to analyse them. I came across this awesome blog by Jesse Steinweg-Woods published in March 2015. He looked at the data science tools mentioned in the job ads aggregated by indeed.com for Seattle, Chicago, San Francisco, New York and Nationwide.

His findings from nationwide search suggested that Python is much more in demand than R. This really intrigued me to see if the trends he found a year ago have changed.

Goals

My initial goals were

  1. To examine the Data Science skills that most employers are looking for in Canada and U.S.
  2. To examine the ranking of Canadian and American cities based on the amount of job openings for Data Scientists

However, as I searched for job ads for data scientists in Canada, I came to realize that there weren’t enough ads in most of the cities. So I decided to focus on the national trend for Canada instead.

As for U.S., I initially wanted to look at the top 100 largest cities. During the data collection process, some of the cities didn’t have enough or no ads either. So I decided to focus on the following 53 cities, that had enough job ads for data scientists, instead.

During my project meeting with David Robinson, who guided me through this project, he suggested that it would be interesting to look at the trends between data scientists and data analysts as well. So I decided to look at the skills, that the employers are looking for other data-driven jobs (specifically for Data Analysts, Data Architects and Data Engineers) as well.

Data collection

Source of data

I used indeed.com and indeed.ca to gather the job ads data. I also received job ads data from stackoverflow, thanks to David who works there as a data scientist! Due to the confidentiality agreement, I cannot share the stackoverflow data. However, you can find all the data that I scraped off indeed in my github repository.

Web scraping methodology

I wanted to use rvest to scrape data from indeed. However, after spending a few days of trying, I couldn’t get the scraping function to work, since indeed is a job search aggregator and the links it provides go to different company websites that have different structure. This is my first time scraping so I probably didn’t know enough R tools to make it work. You can see a few of my failed attempts in my github repo.

So, I decided to see if I could modify Jesse’s ipython code to get .csv files of the wordcounts (names of the Data Science Tools) from the job ads. His code was written to produce plots of the results right after scraping but not to produce .csv files. I contacted Jesse to get permission to use his code for this project and he kindly gave me permission and some tips.

You can find the ipython notebooks that I used to scrape indeed here.

You may see in the ipython notebooks that the code was a bit repetitive. I was trying to stick to d.r.y (don’t repeat yourself) principle but I have never programmed in Python, so this will be a work in progress for me to make the code neater. In the meantime, this code was enough to collect the data.

Data was collected for job ads that had exactly “Data Analyst”, “Data Scientist”, “Data Architect” or “Data Engineer” in its text. Since the search was done for each job title for each of the 53 cities mentioned above, I ended up with 212 (53 times 4) .csv files. I had to do quite a bit of wrangling to comebine them together.

Out of the 53 cities, 14 of them had less than 10 or no job postings for data engineers. Therefore, I used 39 cities to compare the skills among “Data Anlysts”, “Data Scientists”, “Data Architects” and “Data Engineers”.

The data collected include the percentage of job ads that contain the names of the analysis tools or programming languages, specified below, and the number of job postings per each job title in each city.

  • R
  • Python
  • Java
  • C++
  • Ruby
  • Perl
  • Matlab
  • Javascript
  • Scala
  • Excel
  • Tableau
  • D3.js
  • SAS
  • SPSS
  • D3
  • Hadoop
  • MapReduce
  • Spark
  • Pig
  • Hive
  • Shark
  • Oozie
  • Zookeeper
  • Flume
  • Mahout
  • SQL
  • NoSQL
  • HBase
  • Cassandra
  • MongoDB

Data Wrangling

library(readr)
library(tidyr)
library(dplyr)

Since I have separate .csv files that contain columns term (name of Data Science Tools) & NumPostings (Percentage of Job Ads that contain the term) for DAnalysts, DScientists, DArchitects and DEngineer per city. I’m combining them to analyse the data.

I realized, after doing the following lengthy process, that I could have written a function to do it. However, while I was initially doing it, I couldn’t figure it out. Since I’ve already written .csv files for all the cities, I’m just showing the work I did below. You can skip to “Specific Research Questions”.

You can find the data used for the following wrangling here.

Joining datasets:

#NewYork
NewYork1 <- read_csv("NewYorkSkills1.csv") #.csv for Data Analysts' job ad data
NewYork2 <- read_csv("NewYorkSkills2.csv") #.csv for Data Scientists' job ad data
NewYork3 <- read_csv("NewYorkSkills3.csv") #.csv for Data Architects' job ad data
NewYork4 <- read_csv("NewYorkSkills4.csv") #.csv for Data Engineers' job ad data

names(NewYork1) <- c("Term", "DAnalyst")
names(NewYork2) <- c("Term", "DScientist")
names(NewYork3) <- c("Term", "DArchitect")
names(NewYork4) <- c("Term", "DEngineer")

j1 <- full_join(NewYork1, NewYork2, by="Term") 
j2 <- full_join(NewYork3, NewYork4, by="Term")
NewYork <- full_join(j1, j2, by="Term")

NewYork$City <- "New York"
NewYork$State <- "NY"
NewYork <- replace_na(NewYork, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
#Since NAs mean the term didn't appear in the job postings and thus assigned zero.

write_csv( NewYork, path = "NewYork.csv") 

#Anaheim
Anaheim1 <- read_csv("AnaheimSkills1.csv")
Anaheim2 <- read_csv("AnaheimSkills2.csv")
Anaheim3 <- read_csv("AnaheimSkills3.csv")
Anaheim4 <- read_csv("AnaheimSkills4.csv")

names(Anaheim1) <- c("Term", "DAnalyst")
names(Anaheim2) <- c("Term", "DScientist")
names(Anaheim3) <- c("Term", "DArchitect")
names(Anaheim4) <- c("Term", "DEngineer")

j1 <- full_join(Anaheim1, Anaheim2, by="Term") 
j2 <- full_join(Anaheim3, Anaheim4, by="Term")
Anaheim <- full_join(j1, j2, by="Term")

Anaheim$City <- "Anaheim"
Anaheim$State <- "CA"

Anaheim <- replace_na(Anaheim, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv( Anaheim, path = "Anaheim.csv") 

#Arlington
Arlington1 <- read_csv("ArlingtonSkills1.csv")
Arlington2 <- read_csv("ArlingtonSkills2.csv")
Arlington3 <- read_csv("ArlingtonSkills3.csv")
Arlington4 <- read_csv("ArlingtonSkills4.csv")

names(Arlington1) <- c("Term", "DAnalyst")
names(Arlington2) <- c("Term", "DScientist")
names(Arlington3) <- c("Term", "DArchitect")
names(Arlington4) <- c("Term", "DEngineer")

j1 <- full_join(Arlington1, Arlington2, by="Term") 
j2 <- full_join(Arlington3, Arlington4, by="Term")
Arlington <- full_join(j1, j2, by="Term")

Arlington$City <- "Arlington"
Arlington$State <- "TX"

Arlington <- replace_na(Arlington, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv( Arlington, path = "Arlington.csv") 

#Atlanta
Atlanta1 <- read_csv("AtlantaSkills1.csv")
Atlanta2 <- read_csv("AtlantaSkills2.csv")
Atlanta3 <- read_csv("AtlantaSkills3.csv")
Atlanta4 <- read_csv("AtlantaSkills4.csv")

names(Atlanta1) <- c("Term", "DAnalyst")
names(Atlanta2) <- c("Term", "DScientist")
names(Atlanta3) <- c("Term", "DArchitect")
names(Atlanta4) <- c("Term", "DEngineer")

j1 <- full_join(Atlanta1, Atlanta2, by="Term") 
j2 <- full_join(Atlanta3, Atlanta4, by="Term")
Atlanta <- full_join(j1, j2, by="Term")

Atlanta$City <- "Atlanta"
Atlanta$State <- "GA"

Atlanta <- replace_na(Atlanta, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv( Atlanta, path = "Atlanta.csv") 


#Aurora
Aurora1 <- read_csv("AuroraSkills1.csv")
Aurora2 <- read_csv("AuroraSkills2.csv")
Aurora3 <- read_csv("AuroraSkills3.csv")
Aurora4 <- read_csv("AuroraSkills4.csv")

names(Aurora1) <- c("Term", "DAnalyst")
names(Aurora2) <- c("Term", "DScientist")
names(Aurora3) <- c("Term", "DArchitect")
names(Aurora4) <- c("Term", "DEngineer")

j1 <- full_join(Aurora1, Aurora2, by="Term") 
j2 <- full_join(Aurora3, Aurora4, by="Term")
Aurora <- full_join(j1, j2, by="Term")

Aurora$City <- "Aurora"
Aurora$State <- "CO"

Aurora <- replace_na(Aurora, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv( Aurora, path = "Aurora.csv") 

#Austin
Austin1 <- read_csv("AustinSkills1.csv")
Austin2 <- read_csv("AustinSkills2.csv")
Austin3 <- read_csv("AustinSkills3.csv")
Austin4 <- read_csv("AustinSkills4.csv")

names(Austin1) <- c("Term", "DAnalyst")
names(Austin2) <- c("Term", "DScientist")
names(Austin3) <- c("Term", "DArchitect")
names(Austin4) <- c("Term", "DEngineer")

j1 <- full_join(Austin1, Austin2, by="Term") 
j2 <- full_join(Austin3, Austin4, by="Term")
Austin <- full_join(j1, j2, by="Term")

Austin$City <- "Austin"
Austin$State <- "TX"

Austin <- replace_na(Austin, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv( Austin, path = "Austin.csv") 

#Baltimore
Baltimore1 <- read_csv("BaltimoreSkills1.csv")
Baltimore2 <- read_csv("BaltimoreSkills2.csv")
Baltimore3 <- read_csv("BaltimoreSkills3.csv")
Baltimore4 <- read_csv("BaltimoreSkills4.csv")

names(Baltimore1) <- c("Term", "DAnalyst")
names(Baltimore2) <- c("Term", "DScientist")
names(Baltimore3) <- c("Term", "DArchitect")
names(Baltimore4) <- c("Term", "DEngineer")

j1 <- full_join(Baltimore1, Baltimore2, by="Term") 
j2 <- full_join(Baltimore3, Baltimore4, by="Term")
Baltimore <- full_join(j1, j2, by="Term")

Baltimore$City <- "Baltimore"
Baltimore$State <- "MD"

Baltimore <- replace_na(Baltimore, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Baltimore, path = "Baltimore.csv") 


#Boston
Boston1 <- read_csv("BostonSkills1.csv")
Boston2 <- read_csv("BostonSkills2.csv")
Boston3 <- read_csv("BostonSkills3.csv")
Boston4 <- read_csv("BostonSkills4.csv")

names(Boston1) <- c("Term", "DAnalyst")
names(Boston2) <- c("Term", "DScientist")
names(Boston3) <- c("Term", "DArchitect")
names(Boston4) <- c("Term", "DEngineer")

j1 <- full_join(Boston1, Boston2, by="Term") 
j2 <- full_join(Boston3, Boston4, by="Term")
Boston <- full_join(j1, j2, by="Term")

Boston$City <- "Boston"
Boston$State <- "MA"

Boston <- replace_na(Boston, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Boston, path = "Boston.csv") 

#Boulder
Boulder1 <- read_csv("BoulderSkills1.csv")
Boulder2 <- read_csv("BoulderSkills2.csv")
Boulder3 <- read_csv("BoulderSkills3.csv")
Boulder4 <- read_csv("BoulderSkills4.csv")

names(Boulder1) <- c("Term", "DAnalyst")
names(Boulder2) <- c("Term", "DScientist")
names(Boulder3) <- c("Term", "DArchitect")
names(Boulder4) <- c("Term", "DEngineer")

j1 <- full_join(Boulder1, Boulder2, by="Term") 
j2 <- full_join(Boulder3, Boulder4, by="Term")
Boulder <- full_join(j1, j2, by="Term")

Boulder$City <- "Boulder"
Boulder$State <- "CO"

Boulder <- replace_na(Boulder, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Boulder, path = "Boulder.csv") 

#Chandler
a <- read_csv("ChandlerSkills1.csv")
b <- read_csv("ChandlerSkills2.csv")
c <- read_csv("ChandlerSkills3.csv")
d <- read_csv("ChandlerSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Chandler <- full_join(j1, j2, by="Term")

Chandler$City <- "Chandler"
Chandler$State <- "AZ"

Chandler <- replace_na(Chandler, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Chandler, path = "Chandler.csv") 

#Charlotte
a <- read_csv("CharlotteSkills1.csv")
b <- read_csv("CharlotteSkills2.csv")
c <- read_csv("CharlotteSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Charlotte <- full_join(j1, c, by="Term")

Charlotte$City <- "Charlotte"
Charlotte$State <- "NC"

Charlotte <- replace_na(Charlotte, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Charlotte, path = "Charlotte.csv") 

#Chicago
Chicago1 <- read_csv("ChicagoSkills1.csv")
Chicago2 <- read_csv("ChicagoSkills2.csv")
Chicago3 <- read_csv("ChicagoSkills3.csv")
Chicago4 <- read_csv("ChicagoSkills4.csv")

names(Chicago1) <- c("Term", "DAnalyst")
names(Chicago2) <- c("Term", "DScientist")
names(Chicago3) <- c("Term", "DArchitect")
names(Chicago4) <- c("Term", "DEngineer")

j1 <- full_join(Chicago1, Chicago2, by="Term") 
j2 <- full_join(Chicago3, Chicago4, by="Term")
Chicago <- full_join(j1, j2, by="Term")

Chicago$City <- "Chicago"
Chicago$State <- "IL"

Chicago <- replace_na(Chicago, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Chicago, path = "Chicago.csv") 

#Cincinnati
a <- read_csv("CincinnatiSkills1.csv")
b <- read_csv("CincinnatiSkills2.csv")
c <- read_csv("CincinnatiSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Cincinnati <- full_join(j1, c, by="Term")

Cincinnati$City <- "Cincinnati"
Cincinnati$State <- "OH"

Cincinnati <- replace_na(Cincinnati, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Cincinnati, path = "Cincinnati.csv") 

#Columbus
a <- read_csv("ColumbusSkills1.csv")
b <- read_csv("ColumbusSkills2.csv")
c <- read_csv("ColumbusSkills3.csv")
d <- read_csv("ColumbusSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Columbus <- full_join(j1, j2, by="Term")

Columbus$City <- "Columbus"
Columbus$State <- "OH"

Columbus <- replace_na(Columbus, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Columbus, path = "Columbus.csv") 

#Dallas
a <- read_csv("DallasSkills1.csv")
b <- read_csv("DallasSkills2.csv")
c <- read_csv("DallasSkills3.csv")
d <- read_csv("DallasSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Dallas <- full_join(j1, j2, by="Term")

Dallas$City <- "Dallas"
Dallas$State <- "TX"

Dallas <- replace_na(Dallas, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Dallas, path = "Dallas.csv") 

#Denver
a <- read_csv("DenverSkills1.csv")
b <- read_csv("DenverSkills2.csv")
c <- read_csv("DenverSkills3.csv")
d <- read_csv("DenverSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Denver <- full_join(j1, j2, by="Term")

Denver$City <- "Denver"
Denver$State <- "CO"

Denver <- replace_na(Denver, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Denver, path = "Denver.csv") 

#Detroit
a <- read_csv("DetroitSkills1.csv")
b <- read_csv("DetroitSkills2.csv")
c <- read_csv("DetroitSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Detroit <- full_join(j1, c, by="Term")

Detroit$City <- "Detroit"
Detroit$State <- "MI"

Detroit <- replace_na(Detroit, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Detroit, path = "Detroit.csv") 

#Durham
a <- read_csv("DurhamSkills1.csv")
b <- read_csv("DurhamSkills2.csv")
c <- read_csv("DurhamSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Durham <- full_join(j1, c, by="Term")

Durham$City <- "Durham"
Durham$State <- "NC"

Durham <- replace_na(Durham, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Durham, path = "Durham.csv") 

#Fremont
a <- read_csv("FremontSkills1.csv")
b <- read_csv("FremontSkills2.csv")
c <- read_csv("FremontSkills3.csv")
d <- read_csv("FremontSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Fremont <- full_join(j1, j2, by="Term")

Fremont$City <- "Fremont"
Fremont$State <- "CA"

Fremont <- replace_na(Fremont, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Fremont, path = "Fremont.csv") 

#FortWorth
a <- read_csv("FortWorthSkills1.csv")
b <- read_csv("FortWorthSkills2.csv")
c <- read_csv("FortWorthSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
FortWorth <- full_join(j1, c, by="Term")

FortWorth$City <- "FortWorth"
FortWorth$State <- "TX"

FortWorth <- replace_na(FortWorth, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(FortWorth, path = "FortWorth.csv") 

#Garland
a <- read_csv("GarlandSkills1.csv")
b <- read_csv("GarlandSkills2.csv")
c <- read_csv("GarlandSkills3.csv")
d <- read_csv("GarlandSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Garland <- full_join(j1, j2, by="Term")

Garland$City <- "Garland"
Garland$State <- "TX"

Garland <- replace_na(Garland, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Garland, path = "Garland.csv") 

#Gilbert
a <- read_csv("GilbertSkills1.csv")
b <- read_csv("GilbertSkills2.csv")
c <- read_csv("GilbertSkills3.csv")
d <- read_csv("GilbertSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Gilbert <- full_join(j1, j2, by="Term")

Gilbert$City <- "Gilbert"
Gilbert$State <- "AZ"

Gilbert <- replace_na(Gilbert, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Gilbert, path = "Gilbert.csv") 

#Glendale
a <- read_csv("GlendaleSkills1.csv")
b <- read_csv("GlendaleSkills2.csv")
c <- read_csv("GlendaleSkills3.csv")
d <- read_csv("GlendaleSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Glendale <- full_join(j1, j2, by="Term")

Glendale$City <- "Glendale"
Glendale$State <- "AZ"

Glendale <- replace_na(Glendale, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Glendale, path = "Glendale.csv")

#Hialeah
a <- read_csv("HialeahSkills1.csv")
b <- read_csv("HialeahSkills2.csv")
c <- read_csv("HialeahSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Hialeah <- full_join(j1, c, by="Term")

Hialeah$City <- "Hialeah"
Hialeah$State <- "FL"

Hialeah <- replace_na(Hialeah, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Hialeah, path = "Hialeah.csv") 

#Houston
a <- read_csv("HoustonSkills1.csv")
b <- read_csv("HoustonSkills2.csv")
c <- read_csv("HoustonSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Houston <- full_join(j1, c, by="Term")

Houston$City <- "Houston"
Houston$State <- "TX"

Houston <- replace_na(Houston, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Houston, path = "Houston.csv") 

#Irvine
a <- read_csv("IrvineSkills1.csv")
b <- read_csv("IrvineSkills2.csv")
c <- read_csv("IrvineSkills3.csv")
d <- read_csv("IrvineSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Irvine <- full_join(j1, j2, by="Term")

Irvine$City <- "Irvine"
Irvine$State <- "CA"

Irvine <- replace_na(Irvine, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Irvine, path = "Irvine.csv") 

#Irving
a <- read_csv("IrvingSkills1.csv")
b <- read_csv("IrvingSkills2.csv")
c <- read_csv("IrvingSkills3.csv")
d <- read_csv("IrvingSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Irving <- full_join(j1, j2, by="Term")

Irving$City <- "Irving"
Irving$State <- "TX"

Irving <- replace_na(Irving, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Irving, path = "Irving.csv") 

#Jersey
a <- read_csv("JerseySkills1.csv")
b <- read_csv("JerseySkills2.csv")
c <- read_csv("JerseySkills3.csv")
d <- read_csv("JerseySkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Jersey <- full_join(j1, j2, by="Term")

Jersey$City <- "Jersey"
Jersey$State <- "NJ"

Jersey <- replace_na(Jersey, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Jersey, path = "Jersey.csv") 

#LongBeach
a <- read_csv("LongBeachSkills1.csv")
b <- read_csv("LongBeachSkills2.csv")
c <- read_csv("LongBeachSkills3.csv")
d <- read_csv("LongBeachSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
LongBeach <- full_join(j1, j2, by="Term")

LongBeach$City <- "LongBeach"
LongBeach$State <- "CA"

LongBeach <- replace_na(LongBeach, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(LongBeach, path = "LongBeach.csv") 

#LosAngeles
a <- read_csv("LosAngelesSkills1.csv")
b <- read_csv("LosAngelesSkills2.csv")
c <- read_csv("LosAngelesSkills3.csv")
d <- read_csv("LosAngelesSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
LosAngeles <- full_join(j1, j2, by="Term")

LosAngeles$City <- "Los Angeles"
LosAngeles$State <- "CA"

LosAngeles <- replace_na(LosAngeles, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(LosAngeles, path = "LosAngeles.csv") 

#Mesa
a <- read_csv("MesaSkills1.csv")
b <- read_csv("MesaSkills2.csv")
c <- read_csv("MesaSkills3.csv")
d <- read_csv("MesaSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Mesa <- full_join(j1, j2, by="Term")

Mesa$City <- "Mesa"
Mesa$State <- "AZ"

Mesa <- replace_na(Mesa, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Mesa, path = "Mesa.csv") 

#Miami
a <- read_csv("MiamiSkills1.csv")
b <- read_csv("MiamiSkills2.csv")
c <- read_csv("MiamiSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Miami <- full_join(j1, c, by="Term")

Miami$City <- "Miami"
Miami$State <- "FL"

Miami <- replace_na(Miami, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Miami, path = "Miami.csv") 

#Minneapolis
a <- read_csv("MinneapolisSkills1.csv")
b <- read_csv("MinneapolisSkills2.csv")
c <- read_csv("MinneapolisSkills3.csv")
d <- read_csv("MinneapolisSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Minneapolis <- full_join(j1, j2, by="Term")

Minneapolis$City <- "Minneapolis"
Minneapolis$State <- "MN"

Minneapolis <- replace_na(Minneapolis, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Minneapolis, path = "Minneapolis.csv") 

#Nashville
a <- read_csv("NashvilleSkills1.csv")
b <- read_csv("NashvilleSkills2.csv")
c <- read_csv("NashvilleSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Nashville <- full_join(j1, c, by="Term")

Nashville$City <- "Nashville"
Nashville$State <- "TN"

Nashville <- replace_na(Nashville, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Nashville, path = "Nashville.csv") 

#Newark
a <- read_csv("NewarkSkills1.csv")
b <- read_csv("NewarkSkills2.csv")
c <- read_csv("NewarkSkills3.csv")
d <- read_csv("NewarkSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Newark <- full_join(j1, j2, by="Term")

Newark$City <- "Newark"
Newark$State <- "NJ"

Newark <- replace_na(Newark, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Newark, path = "Newark.csv") 

#Oakland
a <- read_csv("OaklandSkills1.csv")
b <- read_csv("OaklandSkills2.csv")
c <- read_csv("OaklandSkills3.csv")
d <- read_csv("OaklandSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Oakland <- full_join(j1, j2, by="Term")

Oakland$City <- "Oakland"
Oakland$State <- "CA"

Oakland <- replace_na(Oakland, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Oakland, path = "Oakland.csv") 

#Philadelphia
a <- read_csv("PhiladelphiaSkills1.csv")
b <- read_csv("PhiladelphiaSkills2.csv")
c <- read_csv("PhiladelphiaSkills3.csv")
d <- read_csv("PhiladelphiaSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Philadelphia <- full_join(j1, j2, by="Term")

Philadelphia$City <- "Philadelphia"
Philadelphia$State <- "PA"

Philadelphia <- replace_na(Philadelphia, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Philadelphia, path = "Philadelphia.csv") 

#Pittsburgh
a <- read_csv("PittsburghSkills1.csv")
b <- read_csv("PittsburghSkills2.csv")
c <- read_csv("PittsburghSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Pittsburgh <- full_join(j1, c, by="Term")

Pittsburgh$City <- "Pittsburgh"
Pittsburgh$State <- "PA"

Pittsburgh <- replace_na(Pittsburgh, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Pittsburgh, path = "Pittsburgh.csv") 

#Phoenix
a <- read_csv("PhoenixSkills1.csv")
b <- read_csv("PhoenixSkills2.csv")
c <- read_csv("PhoenixSkills3.csv")
d <- read_csv("PhoenixSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Phoenix <- full_join(j1, j2, by="Term")

Phoenix$City <- "Phoenix"
Phoenix$State <- "AZ"

Phoenix <- replace_na(Phoenix, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Phoenix, path = "Phoenix.csv") 

#Plano
a <- read_csv("PlanoSkills1.csv")
b <- read_csv("PlanoSkills2.csv")
c <- read_csv("PlanoSkills3.csv")
d <- read_csv("PlanoSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Plano <- full_join(j1, j2, by="Term")

Plano$City <- "Plano"
Plano$State <- "TX"

Plano <- replace_na(Plano, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Plano, path = "Plano.csv") 

#Pittsburgh
a <- read_csv("PittsburghSkills1.csv")
b <- read_csv("PittsburghSkills2.csv")
c <- read_csv("PittsburghSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Pittsburgh <- full_join(j1, c, by="Term")

Pittsburgh$City <- "Pittsburgh"
Pittsburgh$State <- "PA"

Pittsburgh <- replace_na(Pittsburgh, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Pittsburgh, path = "Pittsburgh.csv") 

#Portland
a <- read_csv("PortlandSkills1.csv")
b <- read_csv("PortlandSkills2.csv")
c <- read_csv("PortlandSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Portland <- full_join(j1, c, by="Term")

Portland$City <- "Portland"
Portland$State <- "OR"

Portland <- replace_na(Portland, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Portland, path = "Portland.csv") 

#Raleigh
a <- read_csv("RaleighSkills1.csv")
b <- read_csv("RaleighSkills2.csv")
c <- read_csv("RaleighSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Raleigh <- full_join(j1, c, by="Term")

Raleigh$City <- "Raleigh"
Raleigh$State <- "NC"

Raleigh <- replace_na(Raleigh, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Raleigh, path = "Raleigh.csv") 

#SanDiego
a <- read_csv("SanDiegoSkills1.csv")
b <- read_csv("SanDiegoSkills2.csv")
c <- read_csv("SanDiegoSkills3.csv")
d <- read_csv("SanDiegoSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
SanDiego <- full_join(j1, j2, by="Term")

SanDiego$City <- "SanDiego"
SanDiego$State <- "CA"

SanDiego <- replace_na(SanDiego, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(SanDiego, path = "SanDiego.csv") 

#SanFrancisco
a <- read_csv("SanFranciscoSkills1.csv")
b <- read_csv("SanFranciscoSkills2.csv")
c <- read_csv("SanFranciscoSkills3.csv")
d <- read_csv("SanFranciscoSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
SanFrancisco <- full_join(j1, j2, by="Term")

SanFrancisco$City <- "San Francisco"
SanFrancisco$State <- "CA"

SanFrancisco <- replace_na(SanFrancisco, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(SanFrancisco, path = "SanFrancisco.csv") 

#SanJose
a <- read_csv("SanJoseSkills1.csv")
b <- read_csv("SanJoseSkills2.csv")
c <- read_csv("SanJoseSkills3.csv")
d <- read_csv("SanJoseSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
SanJose <- full_join(j1, j2, by="Term")

SanJose$City <- "San Jose"
SanJose$State <- "CA"

SanJose <- replace_na(SanJose, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(SanJose, path = "SanJose.csv") 

#SantaAna
a <- read_csv("SantaAnaSkills1.csv")
b <- read_csv("SantaAnaSkills2.csv")
c <- read_csv("SantaAnaSkills3.csv")
d <- read_csv("SantaAnaSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
SantaAna <- full_join(j1, j2, by="Term")

SantaAna$City <- "Santa Ana"
SantaAna$State <- "CA"

SantaAna <- replace_na(SantaAna, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(SantaAna, path = "SantaAna.csv") 

#Scottsdale
a <- read_csv("ScottsdaleSkills1.csv")
b <- read_csv("ScottsdaleSkills2.csv")
c <- read_csv("ScottsdaleSkills3.csv")
d <- read_csv("ScottsdaleSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Scottsdale <- full_join(j1, j2, by="Term")

Scottsdale$City <- "Scottsdale"
Scottsdale$State <- "AZ"

Scottsdale <- replace_na(Scottsdale, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Scottsdale, path = "Scottsdale.csv") 

#Seattle
a <- read_csv("SeattleSkills1.csv")
b <- read_csv("SeattleSkills2.csv")
c <- read_csv("SeattleSkills3.csv")
d <- read_csv("SeattleSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Seattle <- full_join(j1, j2, by="Term")

Seattle$City <- "Seattle"
Seattle$State <- "WA"

Seattle <- replace_na(Seattle, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Seattle, path = "Seattle.csv") 

#St.Louis
a <- read_csv("St.LouisSkills1.csv")
b <- read_csv("St.LouisSkills2.csv")
c <- read_csv("St.LouisSkills3.csv")
d <- read_csv("St.LouisSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
St.Louis <- full_join(j1, j2, by="Term")

St.Louis$City <- "St. Louis"
St.Louis$State <- "MO"

St.Louis <- replace_na(St.Louis, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(St.Louis, path = "St.Louis.csv") 

#St.Paul
a <- read_csv("St.PaulSkills1.csv")
b <- read_csv("St.PaulSkills2.csv")
c <- read_csv("St.PaulSkills3.csv")
d <- read_csv("St.PaulSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
St.Paul <- full_join(j1, j2, by="Term")

St.Paul$City <- "St. Paul"
St.Paul$State <- "MN"

St.Paul <- replace_na(St.Paul, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(St.Paul, path = "St.Paul.csv") 

#St.Petersburg
a <- read_csv("St.PetersburgSkills1.csv")
b <- read_csv("St.PetersburgSkills2.csv")
c <- read_csv("St.PetersburgSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
St.Petersburg <- full_join(j1, c, by="Term")

St.Petersburg$City <- "St.Petersburg"
St.Petersburg$State <- "FL"

St.Petersburg <- replace_na(St.Petersburg, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(St.Petersburg, path = "St.Petersburg.csv") 

#Tampa
a <- read_csv("TampaSkills1.csv")
b <- read_csv("TampaSkills2.csv")
c <- read_csv("TampaSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Tampa <- full_join(j1, c, by="Term")

Tampa$City <- "Tampa"
Tampa$State <- "FL"

Tampa<- replace_na(Tampa, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Tampa, path = "Tampa.csv") 

#Washington
a <- read_csv("WashingtonSkills1.csv")
b <- read_csv("WashingtonSkills2.csv")
c <- read_csv("WashingtonSkills3.csv")
d <- read_csv("WashingtonSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Washington <- full_join(j1, j2, by="Term")

Washington$City <- "Washington"
Washington$State <- "DC"

Washington <- replace_na(Washington, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Washington, path = "Washington.csv") 

All the .csv files created from above can be found here.

Combining all cities:

master_39cities <- union(Anaheim, Arlington) %>% union(Atlanta) %>% 
  union(Aurora) %>% union(Austin) %>% union(Baltimore) %>% 
  union(Boston) %>% union(Boulder) %>% union(Chicago) %>% 
  union(Dallas) %>% union(Denver) %>% union(Fremont) %>% 
  union(Garland) %>% union(Irving) %>% union(Jersey) %>% 
  union(LongBeach) %>% union(LosAngeles) %>% union(Minneapolis) %>% 
  union(Newark) %>% union(NewYork) %>% union(Oakland) %>% 
  union(Philadelphia) %>% union(Plano) %>% union(SanFrancisco) %>% 
  union(SanJose) %>% union(SantaAna) %>% union(Seattle) %>% 
  union(St.Louis) %>% union(St.Paul) %>% union(Washington) %>% 
  union(Chandler) %>% union(Columbus) %>% union(Gilbert) %>% 
  union(Glendale) %>% union(Irvine) %>% union(Mesa) %>% 
  union(Phoenix) %>% union(SanDiego) %>% union(Scottsdale)

write_csv(master_39cities, path="master_39cities.csv")

After the lengthy process of wrangling and combining the data, I now have three .csv files:

Specific Research questions

After the data collection process, I narrowed down my research questions to the following.

Exploratory Analysis

library(ggplot2)
library(ggrepel)
library(gridExtra)
library(stringr)
library(wordcloud)
library(RColorBrewer)
library(tm)
library(SnowballC)  

Ranking of American cities based on job ads for Data Analysts, Data Scientists, Data Architects & Data Engineers

Wrangling of num_posting data:

num_posting <- read_csv("num_posting_US.csv") 
#dataset that contains the number of job postings for each job title in each of the 53 cities

num_posting <- num_posting %>% 
  separate(col=`City, State`, into=c("City", "State"), sep=",") 

names(num_posting) <- c("City", "State", "DAnalyst", "DScientist", "DArchitect", "DEngineer")
num_posting <- replace_na(num_posting, list(DEngineer=0)) #Since the NAs for DEngineer are from cities that didn't have enough job postings (less than 10). For ease of analysis, I will consider them to be equal to zero. 

num_posting_long <- num_posting %>% 
  gather(job_title, num_posting, DAnalyst:DEngineer) #tidy data

Plot of number of job postings for each job title in 53 cities:

num_posting %>% 
  ggplot() +
  geom_line(aes(City, DAnalyst, group=1, color="Data Analyst")) +
  geom_line(aes(City, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(City, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(City, DEngineer, group=1, color="Data Engineer")) +
  ylab("Number of job postings") +
  ggtitle("Number of job ads in 53 U.S. cities") +
  theme( text=element_text(size=22), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

You can see that almost all 53 cities had the most number of job ads for Data analysts followed by those for Data scientists, except for San Francisco, San Jose and Seattle.

There were similar number of job ads for data engineers and data architects, except for New York, Newark, Jersey City, San Jose, San Francisco, Seattle, Oakland and Fremont where there was a significantly higher number of job ads for data engineers than data architects.

Plot of total number of job postings (for Data Analysts, Data Scientists, Data Architects & Data Engineers) in 53 cities:

num_posting_long %>% 
  group_by(City) %>%
  summarize(total_num_posting = sum(num_posting)) %>% 
  ggplot(aes(City, total_num_posting))+
  geom_point(color="purple") +
  ylab("Number of job postings") +
  ggtitle("Total number of job postings for Data Analysts, Data Scientists, Data Architects & Data Engineers") +
  theme( text=element_text(size=22), 
        axis.text.x=element_text(angle=90, vjust=1), 
        legend.text = element_text(size = 20))

In Newark, Jersey City, New York, Oakland, Washington, Fremont and San Francisco, there were a total of over 1000 job ads for Data Analysts, Data Scientists, Data Architects & Data Engineers combined together. Boston & Seattle had over 500 whereas San Jose had over 800.

num_posting_long %>% 
  group_by(City) %>%
  summarize(total_num_posting = sum(num_posting)) %>% 
  tbl_df() %>% 
  arrange(desc(total_num_posting)) 
## Source: local data frame [53 x 2]
## 
##             City total_num_posting
##            (chr)             (dbl)
## 1         Newark              1410
## 2   Jersey City               1394
## 3       New York              1333
## 4        Oakland              1066
## 5     Washington              1059
## 6        Fremont              1041
## 7  San Francisco              1014
## 8       San Jose               817
## 9         Boston               657
## 10       Seattle               562
## ..           ...               ...

Comparing demand of data science tools between Data analysts, Data scientists, Data architects & Data engineers in Canada (nationwide)

Creating a master dataset for Canada:

You can find the datasets for Canada here.

canada1 <- read_csv("canada1.csv")
canada2 <- read_csv("canada2.csv")
canada3 <- read_csv("canada3.csv")
canada4 <- read_csv("canada4.csv")

names(canada1) <- c("Term", "DAnalyst")
names(canada2) <- c("Term", "DScientist")
names(canada3) <- c("Term", "DArchitect")
names(canada4) <- c("Term", "DEngineer")

j1 <- full_join(canada1, canada2, by="Term") 
j2 <- full_join(canada3, canada4, by="Term")
canada <- full_join(j1, j2, by="Term")

canada <- replace_na(canada, replace=list(DAnalyst=0, DScientist=0, DArchitect=0, DEngineer=0)) 

canada_long <- canada %>% 
  gather(job_title, percentage, DAnalyst: DEngineer) #tidy

Plot of data science tools vs. Job title for Canada

canada %>% 
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst")) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
  xlab("") + 
  ylab("Percentage appearing in job ads") +
  ggtitle("Data Science Tools in demand in Canada") +
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

For data analysts, 43.2% of job ads asked for SQL, 41.18% for Excel, 16.1% for R, 16.1% for Tableau, 12.3% for Python and 10.5% for SAS.

canada %>% 
  select(Term, DAnalyst) %>% 
  arrange(desc(DAnalyst))
## Source: local data frame [27 x 2]
## 
##          Term  DAnalyst
##         (chr)     (dbl)
## 1         SQL 43.222506
## 2       Excel 41.176471
## 3           R 16.112532
## 4     Tableau 16.112532
## 5      Python 12.276215
## 6         SAS 10.485934
## 7  JavaScript 10.230179
## 8        Perl  8.695652
## 9        Java  8.439898
## 10     Hadoop  6.649616
## ..        ...       ...

For data scientists, 34.4% of job ads asked for R, 34.4% for Python, 30.5% for SQL, 23.4% for Java, 19.5% for Tableau, 18.8% for Hadoop and 11.7% for Excel.

canada %>% 
  select(Term, DScientist) %>% 
  arrange(desc(DScientist))
## Source: local data frame [27 x 2]
## 
##          Term DScientist
##         (chr)      (dbl)
## 1           R   34.37500
## 2      Python   34.37500
## 3         SQL   30.46875
## 4        Java   23.43750
## 5     Tableau   19.53125
## 6      Hadoop   18.75000
## 7         SAS   17.96875
## 8       Spark   17.18750
## 9  JavaScript   13.28125
## 10      Excel   11.71875
## ..        ...        ...

For data architects, 32.7% asked for SQL, 13.9% for Java, 13.9% for Hadoop, 11.9% for Tableau, 11.9% for Python and only 5.9% for R.

canada %>% 
  select(Term, DArchitect) %>% 
  arrange(desc(DArchitect))
## Source: local data frame [27 x 2]
## 
##          Term DArchitect
##         (chr)      (dbl)
## 1         SQL  32.673267
## 2        Java  13.861386
## 3      Hadoop  13.861386
## 4     Tableau  11.881188
## 5      Python  11.881188
## 6  JavaScript   9.900990
## 7         C++   7.920792
## 8        Hive   7.920792
## 9         SAS   6.930693
## 10          R   5.940594
## ..        ...        ...

For data engineers, 66.7% asked for Python, 63.0% for Hadoop, 40.7% for Spark, 37.0% for SQL & Scala, and 18.5% for Java.

canada %>% 
  select(Term, DEngineer) %>% 
  arrange(desc(DEngineer))
## Source: local data frame [27 x 2]
## 
##      Term DEngineer
##     (chr)     (dbl)
## 1  Python  66.66667
## 2  Hadoop  62.96296
## 3   Spark  40.74074
## 4     SQL  37.03704
## 5   Scala  37.03704
## 6    Hive  33.33333
## 7    Java  18.51852
## 8   HBase  18.51852
## 9   NoSQL  14.81481
## 10    Pig  14.81481
## ..    ...       ...

Comparing demand of data science tools between Data analysts, Data scientists, Data architects & Data engineers in USA (nationwide)

Creating a master dataset for USA nationwide count:

You can find the datasets for US here

usa1 <- read_csv("national1.csv")
usa2 <- read_csv("national2.csv")
usa3 <- read_csv("national3.csv")
usa4 <- read_csv("national4.csv")

names(usa1) <- c("Term", "DAnalyst")
names(usa2) <- c("Term", "DScientist")
names(usa3) <- c("Term", "DArchitect")
names(usa4) <- c("Term", "DEngineer")

j1 <- full_join(usa1, usa2, by="Term") 
j2 <- full_join(usa3, usa4, by="Term")
usa <- full_join(j1, j2, by="Term")

USA <- replace_na(usa, replace=list(DAnalyst=0, DScientist=0, DArchitect=0, DEngineer=0)) 

USA_long <- usa %>% 
  gather(job_title, percentage, DAnalyst: DEngineer) #tidy

Plot of data science tools vs. Job title for USA

USA %>% 
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst")) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
  xlab("") + 
  ylab("Percentage appearing in job ads") +
  ggtitle("Data Science Tools in demand in USA") +
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

For data analysts, 56.2% of job ads asked for Excel, 27.5% for SQL, 12.5% for Tableau, 12.4% for R, 10.3% for Java and 5.5% for SAS.

USA %>% 
  select(Term, DAnalyst) %>% 
  arrange(desc(DAnalyst))
## Source: local data frame [30 x 2]
## 
##          Term  DAnalyst
##         (chr)     (dbl)
## 1       Excel 56.221854
## 2         SQL 27.470965
## 3     Tableau 12.491112
## 4           R 12.396302
## 5        Java 10.263096
## 6         SAS  5.451529
## 7      Hadoop  3.602749
## 8      Python  2.939085
## 9        SPSS  2.844276
## 10 JavaScript  2.630955
## ..        ...       ...

For data scientists, 37.7% of job ads asked for Python, 36.6% for R, 29.4% for SQL, 27.0% for Hadoop, 24.9% for Java, 18.7% for Matlab, 12.8% for SAS and only 4.6% for Excel.

USA %>% 
  select(Term, DScientist) %>% 
  arrange(desc(DScientist)) 
## Source: local data frame [30 x 2]
## 
##      Term DScientist
##     (chr)      (dbl)
## 1  Python   37.65892
## 2       R   36.56291
## 3     SQL   29.41692
## 4  Hadoop   27.04954
## 5    Java   24.85752
## 6  Matlab   18.67602
## 7   Spark   13.15213
## 8     SAS   12.80140
## 9    Hive   12.71372
## 10    C++   11.74923
## ..    ...        ...

For data architects, 45.2% asked for SQL, 18.0% for Hadoop, 17.2% for Java, 16.1% for Tableau, 9.1% for Python, 7.9% for R.

USA %>% 
  select(Term, DArchitect) %>% 
  arrange(desc(DArchitect))
## Source: local data frame [30 x 2]
## 
##          Term DArchitect
##         (chr)      (dbl)
## 1         SQL  45.209581
## 2      Hadoop  17.964072
## 3        Java  17.215569
## 4     Tableau  16.092814
## 5       Spark  10.179641
## 6       NoSQL  10.029940
## 7      Python   9.131737
## 8        Hive   8.532934
## 9           R   7.859281
## 10 JavaScript   7.784431
## ..        ...        ...

For data engineers, 47.5% asked for SQL, 36.6% for Hadoop, 30.6% for Python, 28.7% for Java and 9.1% for R.

USA %>% 
  select(Term, DEngineer) %>% 
  arrange(desc(DEngineer)) 
## Source: local data frame [30 x 2]
## 
##      Term DEngineer
##     (chr)     (dbl)
## 1     SQL  47.48031
## 2  Hadoop  36.53543
## 3  Python  30.62992
## 4    Java  28.66142
## 5   Spark  20.78740
## 6    Hive  17.79528
## 7   NoSQL  12.99213
## 8   HBase  11.96850
## 9     Pig  11.10236
## 10  Scala  11.02362
## ..    ...       ...

For data analysts, Excel seems to be more popular in U.S than Canada. Over 40% of ads asked for SQL in Canada whereas only 27.5% in U.S. Less than 15% of ads asked for R and only 3% for Python in US.
For data scientists, Python is winning over R in U.S by 1.1%. However, still a tie in Canada. SQL needed.

Comparing demand of data science tools between Data analysts, Data scientists, Data architects & Data engineers in 39 US cities

Now, let’s see if the trends are different for the U.S. cities data compared to nationwide.

Loading the datasets with the individual cities’ data:

master_39cities <- read_csv("master_39cities.csv")
master_39cities_long <- master_39cities %>% 
  gather(job_title, percentage, DAnalyst: DEngineer) #tidydata

Now, let’s look at the data separately for each job title.

Data Science tools in demand for Data Analysts in 39 cities

g1 <- master_39cities %>% 
  ggplot(aes(Term, DAnalyst)) + 
  geom_boxplot(color="red") +
  xlab("") +
  ylab("Percentage of appearance in job ads") +
  ggtitle("Data Science Tools in demand for Data Analysts")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Data Science Tools in demand for Data Scientists in 39 cities

g2 <- master_39cities %>% 
  ggplot(aes(Term, DScientist)) + 
  geom_boxplot(color="purple") +
  xlab("") +
  ylab("Percentage of appearance in job ads") +
  ggtitle("Data Science Tools in demand for Data Scientists")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Data Science Tools in demand for Data Architects in 39 cities

g3 <- master_39cities %>% 
  ggplot(aes(Term, DArchitect)) + 
  geom_boxplot(color="green") +
  xlab("") +
  ylab("Percentage of appearance in job ads") +
  ggtitle("Data Science Tools in demand for Data Architects")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Data Science Tools in demand for Data Engineers in 39 cities

g4 <- master_39cities %>% 
  ggplot(aes(Term, DEngineer)) + 
  geom_boxplot(color="turquoise") +
  xlab("") +
  ylab("Percentage of appearance in job ads") +
  ggtitle("Data Science Tools in demand for Data Engineers")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))
g1

g2

g3

g4

Comparing demand of Data Science Tools in different U.S. cities

Newark vs. Jersey City vs. New York vs. Washington

g1<- master_39cities%>% 
  filter (City == "New York" | City=="Newark" | City =="Jersey" | City =="Washington") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Oakland vs. Boston vs. San Francisco vs. Fremont

g2 <- master_39cities%>% 
  filter (City == "Boston" | City=="Oakland" | City =="San Francisco" | City =="Fremont") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Los Angeles vs. Atlanta vs. Long Beach vs. Irving

g3 <- master_39cities%>% 
  filter (City == "Los Angeles" | City=="Atlanta" | City =="LongBeach" | City =="Irving") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

San Jose vs. Chicago vs. Anaheim vs. Dallas

g4 <- master_39cities%>% 
  filter (City == "San Jose" | City=="Chicago" | City =="Anaheim" | City =="Dallas") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Garland vs. Plano vs. Arlington vs. Philadelphia

g5 <- master_39cities%>% 
  filter (City == "Garland" | City=="Plano" | City =="Arlington" | City =="Philadelphia") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Seattle vs. Minneapolis vs. St. Paul vs. Baltimore

g6 <- master_39cities%>% 
  filter (City == "Seattle" | City=="Minneapolis" | City =="St. Paul" | City =="Baltimore") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Denver vs. Santa Ana vs. St. Louis vs. Aurora

g7 <- master_39cities%>% 
  filter (City == "Denver" | City=="Santa Ana" | City =="St. Louis" | City =="Aurora") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Mesa vs. Phoenix vs. Scottsdale vs. Irvine

g8 <- master_39cities%>% 
  filter (City == "Mesa" | City=="Phoenix" | City =="Scottsdale" | City =="Irvine") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Chandler vs. Gilbert vs. Glendale vs. Boulder

g9 <- master_39cities%>% 
  filter (City == "Chandler" | City=="Gilbert" | City =="Glendale" | City =="Boulder") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Austin vs. Columbus vs. San Diego

g10 <- master_39cities%>% 
  filter (City == "Austin" | City=="Columbus" | City =="SanDiego") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City, nrow=2) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))
g1

g2

g3

g4

g5

g6

g7

g8

g9

g10

For data scientists, Python is winning over R in west coast cities such as San francisco, San Jose, Oakland & Seattle. It was a bit more 50-50 in east coast, with Jersey City & Boston job ads asking for Python more but New York and Newark asking for R more.

For data analysts, almost all indeed job ads from 39 US cities asked for Excel & SQL.

More job ads ask for Python and Java for data engineers than they did for data architects.

Stackoverflow data: (362 job ads for Data Scientists)

The stackoverflow data had 362 job ads with about 60% of them from U.S. and 40% from countries around the world.

Let’s see if the trends that we observed above for U.S. are similar “internationally”.

load("data_scientist.rda")

stackoverflow <- data_scientist %>% 
  select(JobId, Title, Tags, LocationString, Description, CountryCode, StateCode)
Description <- stackoverflow %>% 
  select(Description)
data_scientist %>% select(CountryCode) %>% filter(CountryCode=="US") %>% nrow() #223 job ads from US
## [1] 223
data_scientist %>% select(CountryCode) %>% filter(CountryCode!="US") %>% nrow() #139 job ads from other countries around the world
## [1] 139

Creating general functions to count the terms (names of the data science tools)

count1 <- function(term, lang) { #Counting just one variation of the term
  lang = str_count(Description$Description, term) %>% data.frame()
  names(lang) <- "count"
  lang <- lang %>% mutate(count=ifelse(count == 0, 0, 1)) 
  #Scenerio 1: If the count is 0, the term didn't appear in the job ad. 
  #Scenerio 2: If the count is not equal to zero, the term appeared 1 or more times in the job ad. 
  #Since we are only interested in the proportion of job ads that contain the term, we will make the count = 1 for scenerio 2. 
  lang
}

count2 <- function(term1, term2, lang) { #Counting 2 variations of the term
  a = str_count(Description$Description, term1) %>% data.frame()
  names(a) <- "count"
  b = str_count(Description$Description, term2) %>% data.frame()
  names(b) <- "count"
  lang <- a + b
  lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
  lang
}

count3 <- function(term1, term2, term3, lang) { #Counting 3 variations of the term
  a = str_count(Description$Description, term1) %>% data.frame()
  names(a) <- "count"
  b = str_count(Description$Description, term2) %>% data.frame()
  names(b) <- "count"
  c = str_count(Description$Description, term3) %>% data.frame()
  names(c) <- "count"
  lang <- a + b + c
  lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
  lang
}

count7 <- function(term1, term2, term3, term4, term5, term6, term7, lang) { #Counting 7 variations of the term
  a = str_count(Description$Description, term1) %>% data.frame()
  names(a) <- "count"
  b = str_count(Description$Description, term2) %>% data.frame()
  names(b) <- "count"
  c = str_count(Description$Description, term3) %>% data.frame()
  names(c) <- "count"
  d = str_count(Description$Description, term4) %>% data.frame()
  names(d) <- "count"
  e = str_count(Description$Description, term5) %>% data.frame()
  names(e) <- "count"
  f = str_count(Description$Description, term6) %>% data.frame()
  names(f) <- "count"
  g = str_count(Description$Description, term7) %>% data.frame()
  names(g) <- "count"
  lang <- a + b + c +d + e+ f + g
  lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
  lang
}

count3_fixed3 <- function(term1, term2, term3, lang){ #Counting 3 fixed variations of the term
  a <- grepl(term1, Description$Description) %>% data.frame()
  #When I was trying to count R, `str_count` was counting all R's in words, not just the specific one separate word "R". 
  #Therefore, I switched to grepl, that allows me to do so. 
  names(a) <- "count"
  a <- a %>% mutate(count=ifelse(count==TRUE, 1, 0))
  b <- grepl(term2, Description$Description) %>% data.frame()
  names(b) <- "count"
  b <- b %>% mutate(count=ifelse(count==TRUE, 1, 0))
  c <- grepl(term3, Description$Description) %>% data.frame()
  names(c) <- "count"
  c <- c %>% mutate(count=ifelse(count==TRUE, 1, 0))
  lang <- a+b+c
  lang <- lang %>% mutate(count=ifelse(count == 0, 0, 1))
  lang
}

count6_fixed6 <- function(term1, term2, term3, term4, term5, term6, lang){ #Counting 6 fixed variations of the term
  a <- grepl(term1, Description$Description) %>% data.frame() 
  names(a) <- "count"
  a <- a %>% mutate(count=ifelse(count==TRUE, 1, 0))
  b <- grepl(term2, Description$Description) %>% data.frame()
  names(b) <- "count"
  b <- b %>% mutate(count=ifelse(count==TRUE, 1, 0))
  c <- grepl(term3, Description$Description) %>% data.frame()
  names(c) <- "count"
  c <- c %>% mutate(count=ifelse(count==TRUE, 1, 0))
  d <- grepl(term4, Description$Description) %>% data.frame()
  names(d) <- "count"
  d <- d %>% mutate(count=ifelse(count==TRUE, 1, 0))
  e <- grepl(term5, Description$Description) %>% data.frame()
  names(e) <- "count"
  e <- e %>% mutate(count=ifelse(count==TRUE, 1, 0))
  f <- grepl(term6, Description$Description) %>% data.frame()
  names(f) <- "count"
  f <- f %>% mutate(count=ifelse(count==TRUE, 1, 0))
  lang <- a+b+c+d+e+f
  lang <- lang %>% mutate(count=ifelse(count == 0, 0, 1))
  lang
}

count9_fixed5 <- function(term1, term2, term3, term4, #term1-5 are fixed words 
                   term5, term6, term7, term8, term9, lang) {
  a <- grepl(term1, Description$Description) %>% data.frame()
  names(a) <- "count"
  a <- a %>% mutate(count=ifelse(count==TRUE, 1, 0))
  b <- grepl(term2, Description$Description) %>% data.frame()
  names(b) <- "count"
  b <- b %>% mutate(count=ifelse(count==TRUE, 1, 0))
  c <- grepl(term3, Description$Description) %>% data.frame()
  names(c) <- "count"
  c <- c %>% mutate(count=ifelse(count==TRUE, 1, 0))
  d <- grepl(term4, Description$Description) %>% data.frame()
  names(d) <- "count"
  d <- d %>% mutate(count=ifelse(count==TRUE, 1, 0))
  e <- grepl(term5, Description$Description) %>% data.frame()
  names(e) <- "count"
  e <- e %>% mutate(count=ifelse(count==TRUE, 1, 0))
  f = str_count(Description$Description, term6) %>% data.frame()
  names(f) <- "count"
  g = str_count(Description$Description, term7) %>% data.frame()
  names(g) <- "count"
  h = str_count(Description$Description, term8) %>% data.frame()
  names(h) <- "count"
  i = str_count(Description$Description, term9) %>% data.frame()
  names(i) <- "count"
  lang <- a + b + c +d + e+ f + g + h+ i
  lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
  lang
}

Creating stack_count that contains the columns that indicate if a particular job ad had the term or not (yes = 1, no =0)

R <- count3_fixed3("\\<R\\>", "\\<R.\\>", "\\<R,\\>", R) 
Python <- count2("Python", "python", Python) 
Java <- count2("Java", "java", Java)
C_plus_plus <- count2("C\\+\\+", "c\\+\\+", C_plus_plus)
Ruby <- count2("Ruby", "ruby", Ruby)
Perl <- count2("Perl", "perl", Perl)
Matlab <- count2("Matlab", "matlab", Matlab)
Javascript <- count2("Javascript", "javascript", Javascript)
Scala <- count2("Scala", "scala", Scala)
Excel <- count2("Excel", "excel",Excel)
Tableau <- count2("Tableau","tableau", Tableau)
D3.js <- count2("D3.js","d3.js", D3.js)
SAS <- count2("SAS","sas", SAS)
SPSS <- count2("SPSS","spss", SPSS)
D3 <- count2("d3", "D3", D3)
Hadoop <- count2("Hadoop", "hadoop", Hadoop)
MapReduce <- count2("MapReduce","mapreduce", MapReduce)
Spark <- count2("Spark", "spark", Spark)
Pig <- count2("Pig","pig", Pig)
Hive <- count2("Hive", "hive", Hive)
Shark <- count2("Shark", "shark", Shark)
Oozie <- count2("Oozie", "oozie", Oozie)
Zookeeper <- count2("Zookeeper", "zookeeper", Zookeeper)
Flume <- count2("Flume", "flume", Flume)
Mahout <- count2("Mahout", "mahout", Mahout)
SQL <- count6_fixed6("\\<SQL\\>", "\\<sql\\>","\\<SQL.\\>", "\\<SQL,\\>", "\\<sql.\\>", "\\<sql,\\>", SQL)
NoSQL <- count2("NoSQL", "nosql", NoSQL) 
HBase <- count2("HBase", "hbase", HBase)
Cassandra <- count2("Cassandra", "cassandra", Cassandra)
MongoDB <- count2("MongoDB","mongodb", MongoDB)

MachineLearning <- count2("Machine Learning", "machine learning", MachineLearning)
BigData <- count3("Big Data", "big data", "Big data", BigData)
PhD <- count3("PhD", "Ph.D", "Ph.D.", PhD)
Bachelors <- count7("Undergraduate", "undergraduate", "BA", "BS", "BSc", "Bachelor\'s", "Bachelors", Bachelors)

Masters <- count9_fixed5("\\<Ms\\>", "\\<MS\\>", "\\<M.S.\\>", "\\<MA\\>", "\\<MBA\\>", "MS degree","Master\'s", "Master", "Masters", Masters)

stack_count <- cbind(R, Python, Java, C_plus_plus, Ruby, Perl, Matlab, Javascript, Scala, Excel, Tableau, D3.js, SAS, SPSS, D3, Hadoop, MapReduce, Spark, Pig, Hive, Shark, Oozie, Zookeeper, Flume, Mahout, SQL, NoSQL, HBase, Cassandra, MongoDB, MachineLearning, BigData, PhD, Bachelors, Masters)

names(stack_count) <- c("R", "Python", "Java", "C_plus_plus", "Ruby", "Perl", "Matlab", "Javascript", "Scala", "Excel", "Tableau", "D3.js", "SAS", "SPSS", "D3", "Hadoop", "MapReduce", "Spark", "Pig", "Hive", "Shark", "Oozie", "Zookeeper", "Flume", "Mahout", "SQL", "NoSQL", "HBase", "Cassandra", "MongoDB", "MachineLearning", "BigData", "PhD", "Bachelors", "Masters")

stack_count <- cbind(stackoverflow, stack_count)

Creating sum_count with columns term, count (total number job ads containg the term) & freq (% of job ads containing the term)

sum_count <- stack_count %>% 
  summarize(R = sum(R), Python = sum(Python), Java= sum(Java), 
            `C++` = sum(C_plus_plus), Ruby= sum(Ruby), Perl= sum(Perl),
            Matlab= sum(Matlab), Javascript= sum(Javascript), Scala= sum(Scala),
            Excel= sum(Excel), Tableau= sum(Tableau), D3.js=sum(D3.js),
            SAS=sum(SAS), SPSS=sum(SPSS), D3=sum(D3), Hadoop=sum(Hadoop),
            MapReduce=sum(MapReduce),Spark=sum(Spark), Pig=sum(Pig), Hive=sum(Hive),
            Shark=sum(Shark), Oozie=sum(Oozie),  Zookeeper=sum(Zookeeper),
            Flume=sum(Flume), Mahout=sum(Mahout), SQL=sum(SQL), NoSQL=sum(NoSQL),
            HBase=sum(HBase), Cassandra=sum(Cassandra),
            MongoDB=sum(MongoDB),MachineLearning=sum(MachineLearning),
            BigData=sum(BigData), PhD=sum(PhD), Bachelors=sum(Bachelors),
            Masters=sum(Masters)) %>% 
  tbl_df() %>% 
  gather(term, count, R:Masters) #count = number of job ads that include the `term`

sum_count <- sum_count %>% mutate(freq=count/362, count) 
#freq = % of job ads that contain the `term`
sum_count %>% 
  filter(term != "Masters" & term!="PhD" & term!="Bachelors") %>% 
  arrange(desc(freq)) 
## Source: local data frame [32 x 3]
## 
##               term count       freq
##              (chr) (dbl)      (dbl)
## 1  MachineLearning   159 0.43922652
## 2          BigData    88 0.24309392
## 3                R    75 0.20718232
## 4           Python    65 0.17955801
## 5            Scala    53 0.14640884
## 6           Hadoop    50 0.13812155
## 7              SQL    44 0.12154696
## 8            Excel    42 0.11602210
## 9             Java    35 0.09668508
## 10           Spark    31 0.08563536
## ..             ...   ...        ...
sum_count %>% 
  filter(term == "Masters" | term =="PhD" | term =="Bachelors") %>% 
  arrange(desc(freq))
## Source: local data frame [3 x 3]
## 
##        term count       freq
##       (chr) (dbl)      (dbl)
## 1       PhD    32 0.08839779
## 2   Masters    29 0.08011050
## 3 Bachelors    19 0.05248619

Comparing data science tools in stackoverflow job ads

sum_count %>% 
  filter(term != "Masters" & term!= "PhD" & term!="Bachelors") %>%
  ggplot(aes(term, freq)) +
  geom_point(size=2) +
  geom_line(aes(term, freq, group=1, color="maroon")) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_blank())

For stackover flow job data, R is still winning over Python by 3%. Hadoop and SQL were equally popular for data scientists. Please note that about 44% of stack overflow job ads for data scientists mentioned machine learning! It seems that it’s becoming important for data scientists to be familiar with machine learning principles and algorithms. Surprisingly, only 24% mentioned Big Data.

Word cloud of Data Science Tools mentioned in Stack Overflow Job Ads

pal2 <- brewer.pal(8,"Dark2")
wordcloud(sum_count$term, sum_count$freq, scale = c(4, 1.5), random.order = FALSE, colors=pal2, rot.per = 0, fixed.asp = FALSE)

Creating a word cloud of Stack Overflow Job Ad Descriptions

Description1 <- Description %>% 
  mutate(Description= gsub("\\[|\\]", "", Description)) %>% 
  mutate(Description= gsub("rel=\"nofollow\"", "", Description)) %>% 
  mutate(Description= gsub("<li>experience", "", Description)) %>% 
  mutate(Description= gsub("<li>work", "", Description)) %>% 
  mutate(Description= gsub("<li><span>", "", Description)) %>% 
  mutate(Description= gsub("<p>", "", Description)) %>% 
  mutate(Description= gsub("</p>", "", Description)) %>% 
  mutate(Description= gsub("<p><strong>", "", Description)) %>% 
  mutate(Description= gsub("<ul>", "", Description)) %>% 
  mutate(Description= gsub("</ul>", "", Description)) %>% 
  mutate(Description= gsub("<li>", "", Description)) %>% 
  mutate(Description= gsub("</li>", "", Description)) %>% 
  mutate(Description= gsub("&amp;", "", Description)) %>% 
  mutate(Description= gsub("&nbsp;", "", Description)) %>%
  mutate(Description= gsub("</strong>", "", Description)) %>% 
  mutate(Description= gsub("<strong>", "", Description)) %>%
  mutate(Description= gsub("<span>", "", Description)) %>% 
  mutate(Description= gsub("<br>", "", Description)) %>% 
  mutate(Description= gsub("<em>", "", Description)) %>% 
  mutate(Description= gsub("</span>", "", Description)) %>% 
  mutate(Description= gsub("<a href", "", Description)) %>% 
  mutate(Description= gsub("=\"http://www", "", Description)) %>% 
  mutate(Description= gsub("&rsquo", "", Description)) 

  
r_words <- c("right", "youll", "get", "next", "high", "bull","way", "etc", "didate", "based", "every", "take", "ndash", "real", "core", "key", "day", "able", "van", "time", "used", "highly", "years", "sets", "want", "seeking", "senior", "related", "join", "ability", "new", "teams", "role", "use", "like", "make", "across", "provide", "drive", "help", "work", "working", "including", "questions", "environment", "implement", "developing", "products", "responsible", "need", "closely", "recommendations", "just", "ing", "set", "andor", "company","apply","part","analysing","analyse","using","can","within","will","youll","well","looking","position","one","also")

review_text <- paste(Description1$Description, collapse=" ")
review_source <- VectorSource(review_text)
corpus <- Corpus(review_source)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, r_words)

dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)

frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
head(frequency)
##       data       team experience   business  scientist   learning 
##       2249        564        382        378        344        326
words <- names(frequency)

wordcloud(words, frequency, scale=c(3.5,.7), min.freq=30, random.order=FALSE, colors=brewer.pal(8, "Dark2"), max.words = 250, rot.per = 0, fixed.asp = FALSE, random.color = TRUE)

Takeaways

So here are my takeaways from this analysis!

You can pick the programming languages to learn, depending on which career in the “data science universe” you want to puruse.

Although Python seems to be gaining momentum over R, they have different strengths, with R being a language developed for statisticians (easier to apply statistical concepts without much experience in programming) vs. Python being a multi-purpose language. So you may want to pick up both for different purposes.

Companies seem to prefer open source languages such as R and Python over commercial one such as Matlab and SAS. Very good news!

It seems that machine learning will become more and more important for data scientists to be familiar with.

Currently, east coast cities have more data science jobs than west coast. So hooray for the east coasters!

However, Silicon valley have more jobs data scientists than data analysts and more for data engineers than data architects. In contrast, there are more jobs for data analysts than data scientists in east coast. East coast cities also have more jobs for data engineers than data architects but not as much as Sillicon valley.